feat(graphrag,mcp,vectordb,storage): tenant isolation rollup (RAN-37, RAN-38, RAN-39, RAN-21, RAN-20)#28
Conversation
Adds tenant_id to the three persisted GraphRAG tables (investigations, graph_snapshots, drain_templates), with a one-shot idempotent backfill so existing rows fall back to DefaultTenantID and a driver-specific composite PK promotion for drain_templates on SQLite/PostgreSQL. - schema.go: TenantID first on DrainTemplateRow; PK becomes (tenant_id, id). - investigation.go: TenantID + idx_investigations_tenant_created composite index; cooldown key now scoped by tenant; GetInvestigations / Get-by-id take a context and filter by storage.TenantFromContext. - snapshot.go: TenantID + idx_graph_snapshots_tenant_created; GetGraphSnapshot is ctx-scoped. takeSnapshot continues to fan out per tenant. - drain.go: SaveDrainTemplates / LoadDrainTemplates now take a tenant string; OnConflict targets composite (tenant_id, id) so the same template hash can coexist across tenants for the eventual per-tenant Drain miner. - migrate.go (new): AutoMigrateGraphRAG runs AutoMigrate, idempotent UPDATE backfill, and a SQLite (rename/recreate/copy) + PostgreSQL (DROP/ADD CONSTRAINT) PK promotion path. MySQL/MSSQL skipped with a log. - mcp/tools.go: thread the dispatcher ctx into get_investigations, get_investigation, get_graph_snapshot so the new ctx-aware signatures compile end-to-end. Subtask C (RAN-39) wires the request tenant. - builder.go / refresh.go: Drain Save/Load callers pin storage.DefaultTenantID for the single shared miner. Tests: - TestAutoMigrateGraphRAG_CreatesTenantCompositeIndexes - TestAutoMigrateGraphRAG_DrainTemplatesCompositePK - TestAutoMigrateGraphRAG_IsIdempotent - TestAutoMigrateGraphRAG_BackfillsLegacyRows - TestSaveLoadDrainTemplates_TenantIsolation (collision across tenants) - TestGraphRAG_GetInvestigations_TenantScoped (cross-tenant id-guess miss) - TestGraphRAG_GetGraphSnapshot_TenantScoped - TestCooldownKey_TenantIsolated Verification: go vet ./... and go test ./... — 281 tests pass across 26 packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62768be8ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := tx.Exec(`ALTER TABLE drain_templates RENAME TO drain_templates__legacy`).Error; err != nil { | ||
| return fmt.Errorf("rename legacy: %w", err) | ||
| } | ||
| // Recreate with the current GORM schema (composite PK + indexes). | ||
| if err := tx.Migrator().CreateTable(&DrainTemplateRow{}); err != nil { |
There was a problem hiding this comment.
Drop legacy indexes before creating new drain_templates table
On SQLite upgrades from the pre-RAN-38 schema, ALTER TABLE ... RENAME TO drain_templates__legacy keeps legacy index names (for first_seen/last_seen) attached to the renamed table. Creating the new table immediately afterward tries to create indexes with the same names and fails with index ... already exists, so PK promotion never completes; because writes now use ON CONFLICT (tenant_id, id), subsequent SaveDrainTemplates calls fail on those upgraded databases.
Useful? React with 👍 / 👎.
…n test (RAN-39)
Threads the tenant resolved by the MCP transport (X-Tenant-ID header → ctx)
into every GraphRAG-backed tool handler and adds the merge-gate integration
test that asserts cross-tenant isolation for the full GraphRAG-backed MCP
surface.
mcp/tools.go
- get_system_graph and get_service_health now accept ctx and route through
GraphRAG.ServiceMap / AllServiceEdges so they pick up RAN-37's per-tenant
in-memory partitioning. Legacy svcGraph remains as a fallback path only
when GraphRAG isn't wired (boot windows, future test harnesses).
- All other GraphRAG handlers were already ctx-threaded after RAN-37/38;
no behavior change for those.
vectordb/index.go (+ main.go, api/similar_handler.go)
- vectordb.Index.Add and Search now take a tenant string; LogVector and
SearchResult carry the tenant tag and Search filters by it. RAN-37
already added tenant args at the call sites in graphrag/clustering.go
and mcp/tools.go but the matching vectordb signature change had not
landed, leaving the branch unbuildable. This closes that gap with the
smallest surgical change; the broader vectordb rework remains RAN-20.
- main.go now passes l.TenantID into vectorIdx.Add on hydration and the
live ingest hook; api/similar_handler resolves tenant from the request
context before searching.
graphrag/builder.go
- New RegisterAnomaly(tenant, AnomalyNode) — small public API symmetric
with PersistInvestigation, used by the new isolation test to seed
per-tenant anomalies without depending on the throttled detector loop.
mcp/tenant_isolation_test.go
- Stands up an in-process MCP server (httptest) wired to GraphRAG over
in-memory SQLite, seeds three tenants (acme, beta, default) with
overlapping service_name / trace_id / span_id / Drain template / log
body / snapshot, and exercises every GraphRAG-backed tool — get_service_map,
get_service_health, get_error_chains, trace_graph, impact_analysis,
root_cause_analysis, correlated_signals, get_anomaly_timeline,
get_investigations, get_investigation (own + cross-tenant id-guess),
get_graph_snapshot, find_similar_logs, get_system_graph — three times
each (X-Tenant-ID acme, X-Tenant-ID beta, no header → DefaultTenantID).
Each response is scanned for the caller's own tenant marker and for any
other seeded tenant's marker (service name, log body, op name, anomaly
evidence, snapshot id) to prove no cross-tenant leak.
Verified: go vet ./... clean; go test ./... clean; go test -race
./internal/{mcp,graphrag}/... clean.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Replace standalone uniqueIndex on Trace.TraceID with composite idx_traces_tenant_trace_id(tenant_id, trace_id) so two tenants can ingest the same trace_id without either dropping into OnConflict.DoNothing. AutoMigrateModels now idempotently discovers and drops any legacy single-column unique index on traces.trace_id (SQLite/Postgres/MySQL/MSSQL) so upgraded databases stop silently blocking cross-tenant reuse. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…isolation Reviewer (cf5145d8) requested three changes on commit c839460. Addresses each with verified failure-on-regression checks. #1 — Cross-tenant boot hydration of vector index Previously main.go hydrated the index via repo.GetLogsV2(appCtx, ...) which is tenant-scoped to whatever tenant ctx carries — appCtx has none, so only the default tenant's rows reloaded after restart and find_similar_logs was cold for every other tenant until fresh ERROR logs landed. The new tenant-aware vectorIdx.Add(..., l.TenantID, ...) didn't fix it because the non-default rows were never fetched. - internal/storage/log_repo.go: new ListRecentHighSeverityLogsAllTenants — explicitly cross-tenant administrative read used only by hydration. Each row carries its own TenantID; fan-out happens in the caller. - main.go: hydration now uses the new method so every tenant's warm index survives a restart. - internal/vectordb/index.go: tenant-aware FIFO eviction. At cap, drop up to maxSize/10 of the inserting tenant's oldest rows so a noisy tenant cannot evict another tenant's warm rows (availability isolation; confidentiality is still enforced by doc.Tenant filtering in Search). Brand-new tenants drop one globally-oldest row to claim a first slot. #2 — root_cause_analysis assertion was vacuous internal/mcp/tenant_isolation_test.go:434 passed an empty ownMarker to assertNoLeak, so the merge gate would still pass if the tool regressed to returning [] for every tenant. Now passes ownService (RankedCause carries Service so it must appear in a non-empty response). Verified by sabotaging the handler to return a nil slice — the assertion fails with 'expected own marker "acme-orders" in response, body=null' as designed, then reverted. #3 — Drain cluster-id test didn't actually compare cluster IDs The previous test reused seedTenant (per-tenant service names) and only scanned response text, so a regression that surfaced the same cluster row across tenants would still pass. Rewritten to: - Use one shared service name across both tenants so Drain produces colliding (service, templateID) keys — the SignalStore partition is the only thing keeping rows separate. - Inspect the actual []graphrag.LogClusterNode returned by CorrelatedSignals (not just response text), checking Template + SampleLog content for own-marker presence and foreign-marker absence. - Log the per-tenant cluster IDs so future refactors that change the ID scheme leave a visible audit trail. - End-to-end probe via the MCP HTTP surface remains, asserting the same isolation reaches clients. RAN-20 supporting tests internal/vectordb/index_test.go, internal/api/similar_handler_test.go, internal/mcp/tools_ran20_test.go cover vectordb tenant scoping at three layers (in-memory, REST handler, MCP tool). They were sitting untracked on the branch from the parallel RAN-20 work; bundling them so the follow-up vectordb behavior added here is covered. Verified: - go vet ./... clean - go test ./... clean (full repo) - go test -race ./internal/{mcp,graphrag,vectordb,api}/... clean Co-Authored-By: Paperclip <noreply@paperclip.ing>
Tightens the docstring to spell out that results are scoped to the tenant on r.Context() (set by TenantMiddleware from X-Tenant-ID) and cross-tenant rows are never returned. Behavior unchanged. Co-Authored-By: Paperclip <noreply@paperclip.ing>
|



Summary
Multi-tenancy isolation across GraphRAG persistence + in-memory + vector index + trace identity, plus tenant-scoped MCP tools. Implements RAN-37, RAN-38, RAN-39, RAN-21, and RAN-20 on a single branch because they share a common code surface and were sequenced as a unit. Each issue is documented inline below; commits are atomic per issue.
Issues addressed
ServiceStore,TraceStore,SignalStore,AnomalyStorekeyed by tenant; query ctx threading).tenant_idcolumn + scoped reads + idempotent backfill on the three persisted GraphRAG tables (investigations,graph_snapshots,drain_templates).trace_iduniqueness so the same trace ID can appear across tenants without collision.vectordb.Indexfilter-on-read + tenant-aware FIFO eviction; cross-tenant startup hydration helper; HTTP + MCP integration tests including FIFO eviction-fairness assertion.RAN-38 — GraphRAG persistence
investigationsandgraph_snapshotsget(tenant_id, created_at)composite indexes.drain_templatesPK is promoted to composite(tenant_id, id)so the same Drain template hash can coexist across tenants once a per-tenant miner ships.GetInvestigations,GetInvestigation, andGetGraphSnapshotnow take acontext.Contextand filter bystorage.TenantFromContext. The MCP dispatcher already had actxin scope and threads it through; cross-tenant id-guess lookups returnErrRecordNotFound.PersistInvestigationwrites its tenant onto the row;takeSnapshotForTenantdoes the same;SaveDrainTemplates/LoadDrainTemplatesnow take atenantparameter and target the composite PK inOnConflict.investigationCooldown's key includes tenant so an error in tenant-A doesn't suppress the same pattern in tenant-B.AutoMigrateGraphRAGruns three passes — AutoMigrate → idempotentUPDATE … WHERE tenant_id IS NULL OR ''backfill across all three tables → driver-specific drain_templates PK promotion. SQLite uses the canonical rename/recreate/copy/drop recipe; PostgreSQL usesALTER TABLE … DROP CONSTRAINT IF EXISTS … ADD PRIMARY KEY (tenant_id, id). MySQL/MSSQL skip with a logged warn so an operator can apply equivalent DDL by hand.RAN-20 — semantic log search tenant isolation
vectordb.Indexnow recordsTenantperLogVectorand filters on read inSearch(tenant, q, k). Empty tenant coerces to platform default (mirrorsstorage.DefaultTenantID).Adddrops up tomaxSize/10oldest entries belonging to the inserting tenant only — a noisy tenant can no longer push another tenant's warm rows out. Brand-new tenant on a full index drops at most one globally-oldest row as a one-time fairness fallback.repo.ListRecentHighSeverityLogsAllTenants(cross-tenant by design, not exposed on any tenant-scoped API surface). Closes the silent failure whereGetLogsV2-based hydration only repopulated the default tenant's index on restart.internal/api/similar_handler.goresolves tenant viastorage.TenantFromContext(r.Context());internal/mcp/tools.go::toolFindSimilarLogsresolves viamcpCtx(ctx) + TenantFromContext.internal/vectordb/index_test.go(5 tests incl.TestFIFOEvictionFairness),internal/api/similar_handler_test.go(HTTP two-tenant integration via realTenantMiddleware),internal/mcp/tools_ran20_test.go(MCP two-tenant + no-tenant-fallback).RAN-21 — trace_id tenant scoping
Promotes the
traces.trace_idunique constraint to(tenant_id, trace_id)so concurrent tenants can ingest overlapping trace IDs without collision. Migration handles SQLite, Postgres; warns on MySQL/MSSQL.RAN-39 — MCP tenant ctx + isolation test
Threads
r.Context()end-to-end through every GraphRAG-backed MCP tool (get_error_chains,trace_graph,impact_analysis,root_cause_analysis,correlated_signals,get_anomaly_timeline,get_service_map,get_investigations,get_investigation,get_graph_snapshot). Addsinternal/mcp/tenant_isolation_test.goas the merge-gate two-tenant integration test.RAN-37 — store partitioning
Introduces
storesFor(ctx)— per-tenantServiceStore/TraceStore/SignalStore/AnomalyStoreresolution. All query helpers take ctx; OnSpan/Log/MetricIngested fan into the right tenant's stores via the ingest tenant resolution chain.Out of scope
g.drainstays single-instance keyed underDefaultTenantID. Tracked separately.TenantPartitioned[K, V]abstraction — TechLead and I converged on ad-hoc maps for this PR; extraction to a follow-up if shapes converge cleanly post-merge.Test plan
go vet ./...cleango build ./...cleango test ./... -count=1— full suite passesgo test -race -count=1 ./internal/{mcp,graphrag,vectordb,api,storage}/...cleantenant_isolation_test.goTestFIFOEvictionFairness(tenant B flood does not evict tenant A canaries) + HTTP/MCP two-tenant integration🤖 Generated with Claude Code